from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<style>
.output_png {
display: table-cell;
text-align: center;
horizontal-align: middle;
vertical-align: middle;
margin:auto;
}
</style>
<form action="javascript:code_toggle()"><input type="submit"
value="Click here to toggle on/off the raw code."></form>''')
from utilities import *
warnings.filterwarnings("ignore")
%matplotlib inline
On May 9, 2022, the Philippines will hold a national election to end President Duterte's six-year reign and elect a new president who will lead the country to recovery. The country has 97 presidential candidates as of November 15, 2021, the deadline for filing candidacies.[1] Only a few presidential candidates have been active on social media, which appears to be the main source of information and news for Filipinos nowadays.[2]
As campaigns ramp up, the voting populace is eager to learn more about the candidates’ platforms and political positions in order to choose the right candidate for them. By analyzing and identifying patterns in the candidates’ social media engagement, the learning team hoped to glean insights into what topics or issues resonated with their followers. Furthermore, as the rivalry heats up, the candidates fight to attract supporters by ensuring maximum visibility and engagement.
Twitter reaches 10.2 million Filipino adults, or 6.6% of those eligible to vote, the majority of whom are between the ages of 18 and 34.[3] Because these Gen Zs are known to value communication and transparency,[4] it is crucial to provide them with information on topics that are relevant to them. Moreover, as the cradle of ‘woke’ and cancel culture,[5] getting a glimpse of the issues that resonated with them can be beneficial to the presidential candidates.
Twitter, on the other hand, being the country's fourth most-used social media platform,[6] can be used to gauge public interest in the candidates. Perhaps it can also be used to reshape their followers' perceptions and combat confirmation bias.
Figure 1 shows an overview of the pipeline employed in this study.
1. Data Extraction
Web scraping using Twitter API was utilized to collect data from the user accounts of the five most popular presidentiables on the platform.
| Presidentiable | Twitter URL | Date of First Tweet | Date of Last Tweet | Total Number of Tweets |
|---|---|---|---|---|
| Leni Robredo | https://twitter.com/lenirobredo | Mar 2020 | Nov 2021 | 824 |
| Bongbong Marcos | https://twitter.com/bongbongmarcos | Aug 2019 | Nov 2021 | 850 |
| Isko Moreno | https://twitter.com/IskoMoreno | Aug 2020 | Nov 2021 | 723 |
| Ping Lacson | https://twitter.com/iampinglacson | July 2016 | Nov 2021 | 846 |
| Bong Go | https://twitter.com/SAPBongGo | Jan 2019 | Nov 2021 | 781 |
The team also collected data from two other presidential candidates visible on Twitter:
a. Manny Pacquiao: https://twitter.com/MannyPacquiao; excluded in the analyses because his tweets are focused on promoting his boxing matches.
b. Ka Leody: https://twitter.com/leodymanggagawa; excluded in the analyses because he only tweeted 18 times, all with low engagement.
Apart from the texts from the first tweets in threads, the team also retrieved the following engagement data from each username:
a. Likes count
b. Retweet count
c. Quote-Tweet count
d. Replies count
2. Data Cleaning and Preprocessing
Before the model can recognize and interpret human language, several preprocessing steps occur in the backend. The features to be analyzed are text tweets from the presidentiables' official user accounts, and text data are often messy. To prepare the data for modeling and analysis, the following steps were followed:
a. Converting accented characters to their base form
Accented characters are important aspects that are used to emphasize a certain word during pronunciation or comprehension. Résumé, café, prótest, divorcé, coördinate, exposé, and latté are a few examples.
b. Lemmatization
Lemmatization is a process of converting text to their word stem, base, or root form.
c. Converting to lowercase
The lower function was used to convert the characters or words to lowercase.
d. Removing punctuation, numbers and special characters
Special characters are non-alphanumeric characters that add no value to text comprehension, and may cause noise in algorithms. Regular expressions (RegEx) was used to eliminate these characters.
e. Removing stop words
Filtering out useless data is a common type of data preprocessing. Stop words are considered meaningless in natural language processing, and are ignored in analysis. The stop words were filtered and deleted using the NLTK stop words and advertools modules.
f. Removing extra whitespaces and tabs
Extra whitespaces and tabs were also removed because they provide no value to text processing.
g. Removing emojis
Emojis are small digital images or icons that are widely used in messaging and social media. Because the collected emojis are scarce, they were excluded in this study.
3. Information Retrieval
The Term Frequency and Inverse Document Frequency (TF-IDF) featurization method was used to determine the relevance of the words to the entire tweet samples. The corpus are converted to a sparse matrix of TF-IDF score, wherein the rows represent words and columns represent documents. TF-IDF assigns a weight for term j in document i as follows:
$$ w_{i,j} = tf_{i,j} \times \log(\frac{N}{df_i}) $$ where:
The TF-IDF sparse matrices representation of token counts served as the bases for the two sets of models created in this project: binary classification and multi-class classification.
4. Exploratory Data Analysis (EDA)
EDA was performed to the cleaned and preprocessed data in order to obtain insights into the overall behavior of the datasets. The methods performed varied from basic descriptive statistics to data visualization through the use of line charts, boxplots, and word clouds.
5. Model Creation
The team created two sets of model to fulfill the project objective:
a. Five binary classification models
These models were used to identify the features that have high and low levels of engagement for each candidate, and included the following steps:
a. Binning of the the least and most engaged tweets using a quantile-based discretization function qcut.
b. Implementing the seven ML classification models using the candidate as the target variable and the words/texts of their tweets as the features.
Decision Tree
Random Forest
Gradient Boosting Method
Logistic Regression (L1)
Linear SVM (L1)
Logistic Regression (L2)
Linear SVM (L2)
c. Improving the model by hypertuning the parameters.
d. Using the accuracy metric to evaluate the models, and determine the best model and corresponding predictor.
b. Two multi-class classification models
The team explored two models to predict which candidate is most likely to post a given tweet.
The models utilized all of the data collected from the selected presidential candidates, and were evaluated using the accuracy scores and confusion matrix.
The models differ in the following:
Model 2a: included both text and numerical data; the target variable was the candidate, and the features were the words or text of the tweets,
along with engagement metrics.
Model 2b: included all text data only, wherein the target variable was the candidate, and the features were the words or text of the tweets.
6. Interpretability
Model interpretability methods were integrated into this study to ensure that the results of the models are intuitive to the target stakeholders. This was also critical to the team’s understanding of the feature words that drive the low and high prediction.
a. SHAP
The team implemented SHapley Additive exPlanations (SHAP) using Logistic Regression as the base model to identify the words or topics that are important to each binary classification model that represents each candidate. Linear Explainer was utilized because the base is a linear model.
b. LIME
The Local Interpretable Model-agnostic Explanations (LIME) technique was applied to the multi-class classification models to determine the likelihood that a specific candidate would post an input string of words, or synthetic tweets as the team coined them.
7. Insights Generation
The interpretability methods extracted the important feature words from the two sets of models. The team then created visualizations to interpret the results and generate insights.
a. SHAP Beeswarm Plots
These are information-dense summaries of how the top feature words in tweet samples impact the models' prediction. This type of visualization was primarily used for binary classification models to determine which topics or issues resonated the most and least with the presidentiables' followers.
b. Word Clouds
This type of visualization was used in conjunction with the SHAP beeswarm plots. These are graphical representations of the feature words' SHAP values multiplied by 100 for display purposes.
c. LIME Visualization
The LIME charts were utilized in the second set of models to display the contribution of each feature word to the prediction of the candidate most likely to have posted the input/synthetic tweets. This enabled the team to determine which words had the biggest impact on the prediction.
d. Screenshots of the Actual Tweets
To complete the storytelling, the team visited the candidates’ Twitter pages and queried the low and high-scoring feature words identified from the above visualizations. This step enabled the team to gain context and understand the behaviors of the candidates’ Twitter followers.
As briefly discussed in the previous section, the Presidential Candidates' Tweets dataset is constructed by utilizing the Twitter API (please see separate notebook 'Supplementary Notebook- Using Twitter API' for the Python implementation). For this project, the team decided to store the dataset in CSV format for easy processing. The table below provides all the fields from the testing data:
id |
int |
Unique ID of the tweet |
created_at |
date |
Date and time the tweet is posted in the platform |
text |
string |
Text content of the tweet, including links to media and other sites |
public_metrics.retweet_count |
int |
Total number of retweets the tweet obtained |
public_metrics.reply_count |
int |
Total number of replies the tweet obtained |
public_metrics.like_count |
int |
Total number of likes the tweet obtained |
public_metrics.quote_count |
int |
Total number of quote tweets (retweets with caption) the tweet obtained |
username |
string |
The username of the tweet creator |
df = pd.read_csv('candidate_tweetsv2.csv')
df = df[~df.username.isin(['MannyPacquiao', 'LeodyManggagawa'])]
df.head()
We perform data exploration based on the collected data. The sections below will illustrate the count of candidate tweets collected.
# plot count of tweets per candidate
tweet_count(df)
We also present below the tweets over time per candidate, including the earliest and latest tweet available in our dataset. Additionally, the count, and the relevant quartiles are also presented.
tweet_distribution(df, 'IskoMoreno')
tweet_distribution(df, 'SAPBongGo')
tweet_distribution(df, 'bongbongmarcos')
tweet_distribution(df, 'iampinglacson')
tweet_distribution(df, 'lenirobredo')
Word cloud can be used to visualize the distribution and characteristics of words which is going to be the basis of preprocessing techniques such as TFIDF, lemmatization, filtering of most and least common words.
word_cloud_candidate(df)
The team created a word cloud for each of the candidate and observed the following:
Manila and MANILA to be counted as single word.nltk stopwords and Filipino stopwords from advertools to prevent low-level information such as about, without, etc from our tweets in order to give more importance to the relevant informationIn order to maximize the accuracy of our models, the team have performed hyperparameter tuning on seven different models that are error based and information based models. The team used stratified train test split with 25% of the dataset as test size and on 10 trials. The chosen metric that the team used in selecting the best model and hyperparameter is accuracy since the data in these models are balanced.
for cand_name in df.username.unique():
display(HTML(f'<b>{cand_name}</b>'))
X, y, tfidf_vectorizer = get_feat_targ(df, cand_name, qcut=2, ngram_start=1)
display(train(X, y))
i = 0
shap_wordcloud(models[i], candidates[i], df, cols_exclude[i],
'PM REVERSE ROUTES', 'medical frontliners')
Leni Robredo’s tweets with words like frontliners, thank, and today are more likely to get higher engagement from her followers. These are indicative of tweets of appreciation for frontliners during the pandemic, as well as daily updates of her activities, usually accompanied by photos. However, her tweets with words like route, arrived, and stop are indicative of PSA-like tweets about the free shuttle service of the OVP, and are often met with low engagement.
i = 1
shap_wordcloud(models[i], candidates[i], df, cols_exclude[i],
'vlog', 'birthday')
Looking at Bongbong Marcos’s tweets, those with words like happy and birthday, which are obviously indicative of personal birthday wishes, are more likely to get higher engagement from his followers. Likewise with words like tulong and pandemya, which are indicative of tweets in Filipino.
On the other hand, those with words like vlog, which are obviously promotions for his personal vlog, are met with low engagement, as well as those with words like health and covid, which are indicative of his tweets in English.
i = 2
shap_wordcloud(models[i], candidates[i], df, cols_exclude[i],
'coronavirus', 'maynila')
Isko Moreno, as the mayor of the City of Manila, frequently posts about the interests of his city. His high-engagement tweets usually include words like maynila, salamat, and biliskilos, which are indicative of posts in Filipino. These are also usually accompanied by photos, frequently with him in it. On the other hand, his low-engagement tweets usually include words like covid, monitoring, latest, and manila, which are indicative of PSA-like tweets about the COVID situation in Manila that are posted in English.
i = 3
shap_wordcloud(models[i], candidates[i], df, cols_exclude[i],
'bong', 'biktima')
Bong Go was included in the analysis as he was still in the presidential race when this analysis was performed. His is a particularly interesting case as most of his tweets are in Filipino, which did not make a particularly good predictor for the level of engagement his tweet received. However, we noticed that his tweets about Rodrigo Duterte, denoted by the words pangulong duterte, frequently received high engagement, while promotional tweets about himself, denoted by the words kuya, bong, or kuya bong, frequently received low engagement.
i = 4
shap_wordcloud(models[i], candidates[i], df, cols_exclude[i],
'inquirerdotnet', 'country')
Lastly, most of Ping Lacson's tweets express dissent against the government. However, a key difference between his high- and low-engagement tweets is that the former are typically one- to two-sentence tweets that appeal to his followers' emotions, as denoted by words like people, country, leader, andgood, while the latter are frequently technical in nature, as denoted by words like senator, budget, and bill.
The second set of models the group trained is a multi-classification model where the target variable is the candidate and the features are the texts of tweets. We discuss the results below.
For the multi-class classification model, the team have performed hyperparameter tuning by selecting the ideal hyperparameter that maximizes accuracy. Accuracy is chosen for this models since ideally, the model would correctly predict who among the 5 candidates would be tweeting a specific statement. The team used stratified train test split with 25% of the dataset as test size using 10 trials.
mc_df = df.drop('Unnamed: 0', axis=1)
X, y = multi_class_preprocessing(mc_df)
multi_class_model(X, y)
Based on the results above, we choose Logistic Regression with l2 regularization and hyperparameter C=1 since this model provides the highest Test Accuracy at 72.7634%.
Using the selected model mentioned above for the multi-class classification problem, we present below the drivers for this multi-class problem using SHAP for global interpretability and LIME for local interpretability. The next few section of code will discuss this.
multi_class_viz(X, y, mc_df, 0)
Based on the SHAP plot above, we see that if the tweet contains 'manila', 'maynila', or other related text or hashtag, it is highly likely that Isko Moreno tweeted it. If the tweet contains 'bong' or 'kuya' then it is highly unlikely that Isko Moreno tweeted such tweet. Other words that have dominant red dots on the right side of the plot can explain a high likelihood of Isko tweeting the text is displayed above.
multi_class_viz(X, y, mc_df, 1)
Based on the SHAP plot above, we see that if the tweet contains 'kuya', 'kuya bong', 'bong', 'duterte', or 'serbisyo' it is highly likely that Bong Go tweeted it. If the tweet contains 'covid' or 'salamat', etc. then it is highly unlikely that Bong Go tweeted such tweet. Other words that have dominant red dots on the left side of the plot can explain a low likelihood of Bong Go tweeting the text is displayed above.
multi_class_viz(X, y, mc_df, 2)
Based on the SHAP plot above, we see that if the tweet contains 'vlog', 'covid', or tagalog words 'natin', 'upang' it is highly likely that Bongbong Marcos tweeted it. If the tweet contains 'manila' or 'kuya bong', etc. then it is highly unlikely that Bongbong Marcos tweeted such tweet. Other words that have dominant red dots on the right side of the plot can explain a high likelihood of Bongbong Marcos tweeting the text is displayed above.
multi_class_viz(X, y, mc_df, 3)
Based on the SHAP plot above, we see that if the tweet contains 'senate', 'national', or 'budget' it is highly likely that Ping Lacson tweeted it. If the tweet contains 'bong', 'manila', etc. then it is highly unlikely that Ping Lacson tweeted such tweet. Other words, which are mostly in Filipino, that have dominant red dots on the left side of the plot can explain a low likelihood of Ping Lacson tweeting the text is displayed above.
multi_class_viz(X, y, mc_df, 4)
Based on the SHAP plot above, we see that if the tweet contains 'route', 'arrived', or 'city' it is highly likely that Leni Robredo tweeted it. If the tweet contains 'bong', 'upang', 'manila', etc. then it is highly unlikely that Leni Robredo tweeted such tweet. Other words, like 'frontliners', 'service', that have dominant red dots on the right side of the plot can explain a high likelihood of Leni Robredo tweeting the text is displayed above.
Below are some tweets pulled from the test dataset and the corresponding actual label and predicted label. The first example is based from a randomized index.
LIME is able to highlight relevant keywords that contribute if the tweet is a certain candidate or note.
sample_index = np.random.randint(len(X_test))
le = LabelEncoder()
y = le.fit_transform(df_all.username)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2,
random_state=143,
shuffle=True,
stratify=y)
model = LogisticRegression(C = 1)
model.fit(X_train.to_numpy(), y_train)
print('Actual Label:', le.inverse_transform([y_test[sample_index]])[0])
c = make_pipeline(tfidf_vectorizer, model)
class_names = le.classes_
explainer = LimeTextExplainer(class_names = class_names)
exp = explainer.explain_instance(df_all[df_all.index == X_test.iloc[sample_index].name].text[X_test.iloc[sample_index].name],
c.predict_proba, top_labels=1,
num_features=10)
exp.show_in_notebook()
We also present 3 synthetic tweets below which are were not tweeted by the candidate themselves.
tweet = 'Get the jab to help the government'
exp = explainer.explain_instance(tweet,
c.predict_proba, top_labels=1,
num_features =10)
exp.show_in_notebook(text=tweet)
tweet = 'Many of our frontliners do their best in order for us to be vaccinated'
exp = explainer.explain_instance(tweet,
c.predict_proba, top_labels=1,
num_features =10)
exp.show_in_notebook(text=tweet)
tweet = 'Maraming salamat at patuloy po tayong nagpapabakuna'
exp = explainer.explain_instance(tweet,
c.predict_proba, top_labels=1,
num_features =10)
exp.show_in_notebook(text=tweet)
We also present the confusion matrix below and recall scores for reference for the selected model. Based on the Test Recall results, there is no class that deteriorates in performance to discriminate based on the model selected.
multi_class_viz(X, y, mc_df, 0, conf_matrix=True)
This study enabled the team to profile the Twitter followers of the presidential candidates. There are some characteristics that overlap with the broader Twitter user demographic, but the distinct behaviors are as follows:
The identified business values can be summarized as follows:
As the election draws closer, more tweets reflecting voters' concerns can be collected and analyzed to provide in-depth insights into their voting preferences and considerations. Insights on controversial election-related issues such as bribery, vote-buying, ghost voting, among others, can also benefit both the candidates and voters, while also aiding in the promotion of fairness throughout the election.
Another application that was not discussed was the extension to other areas of interest. The algorithm and the pipeline are in place, and changing the set of Twitter usernames can lead to new and valuable insights. For example, tweets from anti-environmental sustainability accounts can be analyzed to determine the topics on which their followers agree, providing environmental advocates with talking points for educating the opposition.
[1] Rappler. (29 October 2021). Comelec releases tentative list of candidates for 2022 polls. https://www.rappler.com/nation/elections/comelec-release-tentative-list-candidates-2022-polls-october-29-2021.
[2] Reuters Institute. (July 2021). Reuters Institute: Digital News Report 2021. https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2021-06/Digital_News_Report_2021_FINAL.pdf.
[3] DataRePortal. (11 February 2021). Digital 2021: The Philippines. https://datareportal.com/reports/digital-2021-philippines.
[4] Forbes. (04 May 2021). How Gen-Z Is Bringing A Fresh Perspective To The World Of Work. https://www.forbes.com/sites/ashleystahl/2021/05/04/how-gen-z-is-bringing-a-fresh-perspective-to-the-world-of-work/?sh=2c6f960f10c2.
[5] Think with Google. (January 2021). Stay woke: How Gen Z is teaching us about the future of news and information. https://www.thinkwithgoogle.com/intl/en-apac/consumer-insights/consumer-trends/stay-woke-how-gen-z-teaching-us-about-future-news-and-information/.
[6] DataRePortal. (11 February 2021). Digital 2021: The Philippines. https://datareportal.com/reports/digital-2021-philippines.
As discussed briefly in Methodology, two multi-class classification models are created in exploring the best features that will drive our prediction. For this part of the appendix, we will explore how the multi-classification model performs when we include the different engagement metrics: retweets count, likes count, replies count, and quote tweets count.
multi_class_engagement_model()
As seen in the results above, we can see that Gradient Boosing Classifier works best in terms of accuracy. However, because of the limitations of the current implementation of SHAP towards multi-class tree-based classification and for better interpretability through LIME, we opted to remove our engagement metrics as part of the features.